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A system for determining the body pose of a person from images 

Timothy J Roberts, Stephen J McKenna and Ian W Ricketts 
Division of Applied Computing, University of Dundee 

9 April 2003 



Field of Invention 

A system is described that takes as Input a live or pre-recorded image in which a 
person is present and computes estimates of the position and orientation of each of the 
person's major body parts (hat are visible in that image. 

Background 

The system can be used to locate and estimate the pose of a person in a single 
monocular image. Alternatively, it can be used during tracking of the person in a 
sequence of images by combining J L with a temporal prior propagated from other 
images in the sequence. In this case, it allows tracking to reinitialise after partial or 
fall occlusion or after tracking of certain body parts fails temporarily for some other 
reason. Alternatively, it can be used in a multi-camera system to estimate the person's 
pose from several views captured simultaneously. 

Many applications follow from this ability to determine body pose. The pose 
parameters determined can be used as control inputs to drive a computer game or 
some other motion-driven or gesture-driven human-computer interface. Alternatively* 
the pose parameters can be used to control computer graphics, perhaps an avatar. The 
pose of a person can be used in the context of an art installation or a museum 
installation to enable the installation to respond interactively to the person's body 
movements. The detection and pose estimation of people in video images in particular 
can be used as part of automated, monitoring and surveillance applications such as 
security or care of the elderly. The system could be used as part of a markerless 
motion-capture system with all the applications that has such as animation for 
entertainment and gait analysis. In particular, it could be used to analyse golf swings. 
The system could be used to analyse image/video archives or as part of an image 
indexing system. 

Summary of Invention 

The system has the following essential features: 

■ ■ 

(i) Body part detectors, one for each body part such as an upper arm, a lower 
leg or a head. These determine how likely it is that an individual body part 
is present at a given position, orientation and scale in the image with a 
particular shape and elongation, 

(ii) A mechanism for scoring a hypothesised configuration of body parts as a 
whole, A configuration may consist of a single part (e.g. a head), two parts 
(e.g. an upper and lower aim), three parts (e.g. a hand, an upper arm and a 
lower leg) or indeed any number of parts up to and including the entire 
body. The configurations are scored in such a way that the scores of 
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configuretfe^with different numbers of parts can be compared fairly. The 
score takes into account self-occlusion. This and the ability to compare 
configurations of different dimensionality aits important features of the 
invention. The score takes into account the outputs of the body part 
detectors as well as other prior knowledge about symmetry (e.g. the lower 
right arm is likely to have a similar colour to the lower left arm) and 
visibility. 

(iii) A population-based search routine which uses the outputs of the body part 
detectors to hypothesise configurations, score these configurations and 
search iterativeJy for probable configurations. Thereby, the search routine 
Is able to 'bootstrap' from the outputs of the individual body part detectors 
to find configurations with more and more parts, ultimately finding a 
configuration comprised of all the parts that are clearly visible in the 
image. Information about the expected body pose can then be given as 
output from the system to control a particular application domain. 

The accompanying paper which will be submitted for publication 
describes in more detail the background to the invention. It compares It with existing 
work and it describes an embodiment of the invention. 

It will be clear to those skilled in the art that modifications and alternatives can be 
practiced within the spirit of the invention. 

For example, 

- the use of histograms could be replaced by some other method of estimating a 
frequency distribution (e.g, mixture models, Parzen windows) or feature 
representation. 

- different methods for comparing feature representations could be used (e.g. 
chi-squared* histogram intersection) . 

- the part detectors could use other features (e,g. responses of local filters such 
as gradient filters, Gaussian derivatives or Oabor fractions) 

- the parts could be parameterised to model perspective projection (as opposed 
to the affine model adopted here) 

- the search over configurations could incorporate any number of the widely 
known methods for high-dimensional search instead of or in combination with 
the methods mentioned above 

- the population-based search could use any number of heuristics to help 
bootstrap the search (e.g. background subtraction, skin colour or other prior 
appearance models, change/motion detection). 
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Human Pose Estimation 
from Real-World Images 

Timothy J, Roberts, Stephen J 4 McKenna, Ian W. Ricketts 
Department of Applied Computing 
University of Dundee 
Dundee, Scotland, DDI 4HN 
trobertaScoraput ing . duadee .ac.uk 
www . computing. dundee , ac . uk/staff /troberts 

Abstract 

* _ 

* 

Thin paper presents * system for human pose estimation item real-world 
Images, lis success is due to the synergy orihree novel components. Flretly. 
a formulation is presented Uiat model* other object occlusion and thereby 
Allows the representation and comparison of partial (lower dimensional) so- 
utions. Secondly, a part detection scheme Is described which, by using the 
high-level shape model earlier In the detection process, is able 10 account for 
texture. Tins results in more accurate part detection in real-world Images, 
Furthermore, the response is less sparse lhan the popular edge approach, al- 
lowing more coarse sampling. Finally, a heuristic search scheme is discussed 
which takes advantage of these properties by firstly identifying possible body 
purls and then Uerailvely combining Biese partial solutions and predicting 
IKS ach,CW "~ option! 

1 Introduction 

Irem renl-worid images, such as those produced by human computer interfaces or video 
.ndssmg applications. This problem Is made difficult primarily by the | a „ variation^ 

~ r ^ r -°r ClUSK ' n ; oU,cr ( nb j ect ocol '» i ° D . My -^ure and clothing shape To 
account Tor h.s lanjo Hrt d complen appearance U la popular to use a high-level, hand-buill 

1ST k ^ , Wlb mewuiwiwnia «° °°n, P uto a score, a saareh is perfumed to 

find the best solution^ . The success of this approach. In terms of its ac«u£cy aTdlft? 

73* ft? • 7? y m *' °^° |C6 ° f model Bnd ih assumption*. iESL 
dehuled, efficicnl human pose estimation from real-world Images requires a system with 
componems .ha. work together synergisiicatly. For cample, L llteitoooo ft££ 
cannot ignore the rcaulting estimation problem. 
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1.1 Outline 

This paper describes three, key, novel ideas. Firstly, a formulation ifl described that allows 
the representation of general ocolusion and the comparison Of partial solutions. Secondly 
a part detection scheme is presented that incorporates the high-level shape model earlier in 
the detection process Urns making part detection of in real-wotld images easier and more 
acoumie. I htrdly. a two-stage heuristic search strategy is discussed that takes ttdventaee 
of the properties i.rthe Induced space by identifying parts aad then iteratively combining 
into larger configurrti kins without making assumptions about self-ocelusion 

nuckground material is discussed throughout Ihe paper where most relevant. Due to 
toe vfrvtoim similarities between pose estimation and tracking and since more work has 
been performed on tracking, references are made to both bodies of literature. 

♦ 

2 Body Model 

A part-hased representation is used to model tho body shape. Each part has a set of 
pose parameter*, which am to bo recovered A particular pose is then described by the 
parameters for eaeh part. /' » , whore / labels the body part and tf= 14. 

« 

2.1 Part Visibility 

tSZ a *£" t " I,lim ° r wmni P° se estimation and tracking systems is that all parte are 
visible. However, since porta may be occluded by other object* or difficult to localise 
for example due id poor ilmminalion. the system should be capable of representing s 
pose with some parts occluded. In this system, occluded parts are not pammeterissd. A 
consequence is thai parts must be parameterised in their own co-ominate system, rather 
hen _h,erareh,CA lly as ts often the case in tracking system, e.g. [15]. Whilst this increases 
the dimension* ,,y of the space, in practice an oflset term is often required to model 
complex jotnls like the shoulder [U] making Ihe difference one of convenience. A pose 
is then descn bed by 1 he set of visible parts only. 

l-nr this pnremeteriaaliun lo be useful, a method such as that described in 5 2 3 is 
reared to compare poses with different numbers of visible parts. A key advantane of 
this approach is the, it also allows the repreaentation of W o/ <£. tower dimensK 

mtgh h* found more oasdy than others. For example, it la often easier to locate p!ra 
h»i do no overlap. Secondly, because of inter-part linking, configurations vXnS 
numbers ol pans contain much nf the overall pose information. For example, knoX 
the po»,», w el just the head, bands and feet greatly constrains ihe pose. 

2.2 Part Shape 

CWm systems often use geometric primitives, such as cylinders, to represent body parts 
LJm .' , , r J W - C . a T'P 0 "™** offlhape Is Ieamt fto m annually sag. 
memed and aligned training images. Altf.ough each part could be represented uaimt a 
dtstr buiwr i or masks, in U,j, wrltett , pr implementation each part is representS 121 a 
single mask. A point in such a mask. mM represents the probability oSShip 
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l'i|ti* 1 : The learnl slruolore maska with major axis rotation magnolia** 

SowSSl! 1 f ° Und ^ ,,,HrBlnall8ln 8 wer *" "-TO^Wlwd shape variatlona in 

„JSr mJ^'if ""T P«"«pecti ve effects, n variation <>r ihe seated prismatic 
™ 1 « m paramufcrmo iha maligned appearand. This reduces olmenslSv 

SoTiSl- , Ij£ f*? B " d r ^°r eS J k 'f inat,c •M 1 *" Wl- "n *is caae. rotation 
about (he m«ju|- »*l.s is also marginalised since It is difficult to obsutvo Thesa a^nrnvi.™ 

te» lurthor justified due to Iho Insensltivi.y of £ i^^JJSSSK 

o exact boundary position. The structure maska used throughout ifee paper we» aZ 

. fil^lf ;, m ( >,,:ni,inlall,>n ' 681:11 P a « tout a centre, frc-.yc). an imaga plane rotation 9 and 
a fbratfwrlMinK pnromeler e. Every pert abo abates ft common scale oaramX! T„ 

num. for ^r-ncc.UKion, depth ordering ia presented. The pose space JS d«oS 
by the Hcplh c.nl^d «, ^ „ {pi} whoro p/ „ (xi:i yeh ft * <"« then denoted 

2.3 Part Appearance 

?£ V »ni me " ,0l,S , for b ,° dy m . 1 dtta ° a ° a lmve been P"* 0 *"* <M">«*h in ihe opinion of 
the authors much W ort« remains to h* done. Matching geometric orimitives m 

*e d ambution of foreground and background local filter responses Ronferd It 758 

rex tZl "eHT b ° l, " d!,ry '- Thi «> i»»"l C ularly Important in the case SvSESg 
™T , l * Urthcrmt ' rc ' ™«««Ultt >OCal filter* n^ponae* to a part boSvrfS 

aw cue whtch necessitates dense sampling. AnotheYpopular JrthoTSmSSS 
cene Sr'l ihe aM ™ "to**- of requiring taowted£ of Sel 5 

2 ' ?* r ' ,eVe , 1 lexlura to immediate appeal but reSes 

Vther ^, tgni.on and no one-to-one correspondence exists between regions and 5» 
In order |„ uccount for texture Bn4 give „ , egfl apar8e ro ^ ftflfib uUsTX^sefme' 
high-level .shape model earlier in the pan detection process OnreZ»tnsT« ! , 
active ,„ »,,„ 6 .oca. edge response^ h JoElifiS,,^ 




Figure 2: lllusttmiion of the body made). Tho masks are transformed into image coordi- 
nates lo collect the foreground and background appearance of the hypothesised pose. 

* 

ture segmentation. Argument* tins now. presented on what information Is most Important 
in formulating a purl detector lhat is (i) feasible lo team and (li) quick to evaluate. Firstly, 
it is assumed ihui nalthor (ho part foregrounds or the Adjoining backgrounds are known, 
NbxU il is arguetl mat In genera] the spatial structure of the foreground and background id 
sufficiently iincimsLraJned that it contains Utile discriminatory information. Whilst this la 
on obvious Hpprnxirnnlion. the computational coat of recovering this information is pro- 
hibitive. l : o i example, describing tho wide variation of layoulB of clothing An a human 
torso wuuld ruquirc a high dimensional model and large amounts oftiairting data. The re- 
maining infomvuion is present in the dijfh-enc&i between the Interna! and external regions 
and the jtiuiiUm'/yot Ihu internal regions. 

To coiled Ihu foreground and background appearances ibr a hypothesised pose the 
struct ure mask for each visible part is projected into tho image as shown in Figure 2. When 
token togaiher the masks determine the probability that a particular image feature belongs 
to a particular pnrt or iho background. A single point may contribute to multiple part ap- 
pearance* ttml Hie background appearance. More specifically* the probability that a feature 
/C*jO hclongs lo parti is w { (x t y) « nuO^y) x 11/ (I -mj{x % y)) wherey labels closer^ vis- 
ible parts, By extension, the background probability ia h(x t y) = 11*0 — where 
k labcLs* ii M vLsihlu parts. 

The foreground appearance of each part la described using a single distribution- In 
this particular i m plum cntat ion, n feature. f(x t y) t is ihe intensity and chrornatietly at the 
pixel, resulting in a limited model of texture. Standard histograms are used to represent 
the possibly molil-moda) distributions. Histograms for each part* /// are formed by adding 
feature" proportional lo their wci&hi Wj[x t y). 

It is proposed lhat the detection of a particular pact should depend only upon regions 
close lo lhat part. Therefore, background features Are collected for each part and only 
from ihu region surrounding that pact A single feature can be shared between the back- 
ground of multiple parts. Jn early implementations the background appsar&nce was also 
represented tunny a feature distribution, but this required a large region of Buppurt lhat 
may not he pruHtrni. Therefore, the background for a particular pan is represented by a set 
of wu'uihftid feature^ ff* - . {H**y)tf(x t y)) 9 whore (jt»y) are eloBe to the mask. 
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2.4 Part Detection 

A n,cH«ir U h required I hut allows tho comparison of configurations with different numbers 
of im.ble parts, lo accomplish lhif> using a probabilistic formulation would require the 
» r n ^? , £; id 1 l ? no < ! ' " MB* d^anaional integration which is clearly nor fea- 
sible, oftc und Po«yih |7] d. scl «s some of tho difficulties of probabilistic modellltw of 
mulbplc p,irt systems. Rather than insisting upon probabilistic modelling. urwionnalLd 
response slrenulh ,a used instead, with response R > I signing the presence of a con- 
figure ion and lurger response indiosrlng greater support for tho configuration. Our for- 
muluHoi. nnrun.lly rewards higher dimensional configurations. The response Is composed 
ot a boundary response, a symmetry response and a link term: 

Part" tymmar/cpabs nel&binmi 

2.4.1 linnnrlary It espouse 

The boundary response Tor a part i s computed from the difference between its foreground 
and baehtrnund uppemvmees foiraed from the high-level shape model. More specifically 
the response is n (unction of the probability of each border point not being foreground: 

The optimal form or this function is the ratio of the probability densities of a part 
given the response, to .ho probability of background given the response [J3). Those Preb- 
ab.hty A,wW* are represented non-parametrically and learnt for each part from training 
■mages. In lb.* nnplemenmtion, to reduce the dimensionality Of the distribution, back- 
ground pomta are grouped together by proximity low segment,. SegrnS are specified 
manual y re uocourtt for the tendency for some segmenta to bo stronger than others and 
reduc. he buns for certain degrees of freedom in elongated shape*. The r wsT^ a 
segmcr. .* the overage of the response of its poin* and segment responses are combined 
In an nd hoc fashion hy assuming independence: eomwnea 

US'. - £H52 

' Pit/Air) w 

2.4.2 Symmetry Response 

If it can he turned thai die subject has body parts for which the foreground appear- 
ance* ore Mppruxunateiy equal, an additional term can be faeluded lo reduce d» num- 
ber of fcho ntriiM and increase accuracy. This Is denoted by As£ Tfnj, J^Hn 

s are compared using the Bhattaohaiyya measure, 
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BH{II,.U,) - >;,; y/ff,(f) x Hj(f), The Bharttwharyya measure for symmetric pair con- 
figurahons and part background oonfisumtions is learnt In a similar way to the border 
response. ™ 

2.4.3 I, ink Term 

uT li ? !™ llld J™"*. 8 " ok le ™ Is tedudod. Bach part ha* * set of control points 
thai link i. to lus natghbounng parts. A link is introduce between each neighbouring cart 
pair mid mkes Ihu value « | (fine distance between the control points of the pair S 
uc less ilv.n ihu un-ponnliaed distance, and L,j . 0 C*-M/« otnewls(L If ^ 
(touring ,v.fw do not link directly, because intervening parts are not p H rameterised, the 
uft-penahsed distance is found by summing the un-penalised distance ov* the complete 

I bpX [Z L M "h b c n d ncen,relBd 88 a lh ™ between partB equiVBlem to 8 Klac6pk ^ 

3 Pose Estimation 

One oin iiluitliry two distinct poso eslimatlon strategies in the literature, one used primal 
ily for ^ngle Images and the other Tor tracking. Single image pose estimation schemes 
often use a .w,.. B iage approach e.g. [7, i 2 j. First, candidate body parts are detected in 
Uw image. Om.ip.ng u than performed, for example by dynamic programming, to iden- 
tify compatible part- and the best possible solution- Whilst this sUtegy Is eSentWd 
global. Ihu feed-IWd nature does not address the problem of partial occlusion. Frame 
by tame trackers cm the od,er hand proceed by localised sampling in the full dimensional 
pos* space. e.g. |6|. around estimates given by a motion model, e.g. [10]. The approach 
usually u,,,,,^ manual initialisation and does not recover ftom significant mm the 
motion model. However, by sampling i» the fi.ll dimensional spae* itXa model self 
oeeloHiim Tracking systems „n« pttipoaa ^phisuw, seanmEd^ e^ Ml 
?lT?i ;*PP Uca " OT 1>ri,w Sanadc seareh technic* to a MMtca J* 

space jU*ntad m W . In the opinion of the authors (Ms ia mate liule sense as one eaSmt 
exlrael .,*.«„ ngful sub-solutions from the stale (the motivation for genetic search) since 

•wavering mi unbiased density estimate [9]. The key to efficient global optimisation is in 
the use „l heuristics that capture the important properties of the space. wSSStolSJ 
a two p] inM .ipproach » proposed to lake advantage of the formulation described above 

3.1 Population Phase 

2h -liiiS J™f wiona | l<^f l 8 | n 8 to P"*> 'Populetion' phase is used to seed the 
searuh I his could ho accomplished using a range of domain specific hourlstics Ineltid- 
2 Mu - <«. tinted foreground appearance (e.g. skin colour), estimated balgrouLd 
appear,,,™, mu, or combinations thereof. In this particular Impfemeawtiopdue to 
the brunt) nature of the response, it is possible to perform coarse i^CSKE ovSr an 
Mnwiel itvhw. The spacing of.be sampling is determined manually on a part by part 
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bftsis. For example, the head is sampled horizontally and vertically ©very 5 pixels (for an 
image nwulutum or 640 x 480) and 8 orientations. Samples wilh a response larger than a 
threshold arc uddctt to the population, 

3.2 Building Phase 

Next, u 'bunding 1 phase is repented fbra fixed length of time to find larger configurations- 
Kor euch iteration □ new generation, of fixed size, is created and evaluated. The best 
assembly i* preserved by adding II to each now generation (the elitest approach). In this 
pwiieulnr implementation the building phase consists of a prediction step, a cross over 
Step and a loco! search step. 

hi tlit- prediction step, aew samples are created by adding parts that are currently not 
paramcterised into the set. This is necessary when the population phase failed to find 
the pan. ftv osnmplc due lo partial occlusion. Currently, this is performed in an ad-hoe 
fashion l>\ adding a single part to the assembly for which theoiosest neighbour is currently 
paranieu-nstod. Pans are inslanliated using the minimum control point distance, <S, to the 
closer luiMhbnur, with a extension sempted from a Gaussian distribution (mean equal 
lo 0.75, standard deviiUion-0. 1 5), aligned with the closest neighbour and »r the lowest 
depth, C'ummily, ihia step produces 25% of the new population. 

In Uw cross over step, new samples arc formed from the previous samples by combin- 
ing hodv parts, The novelty U die improvement in whieh parts are sampled. Firstly an 
assembly Ls randomly sampled from the previous population proportional to Its response 
A. 1 hou * sirt fi le part Is samplod from that assembly proportional to its border response 
I lib, form represents its fitness independent of the complete assembly. A complete 
assembly is produced hy iterating this process, drawing a parts from the set of remaining 
parts m dimpled assemblies, producing an assembly of fifee, No. Currently, N G is chosen 
in an ad-he* lashlon as a sample jrorn a Gaussian with a mean ©qud U> the size of me 
current tvsl assembly (elite) plus I and standard deviation of U This step produce* the 
remammu 75% oFthe new population. 

In Hk» local search step* each member of the resulting population Is improved by 
making lm:«l .sampla* for each part in its assembly. More specifically, a part is sampled 
from the assembly nnd a small random perturbation, determined on a part by part basis 
is mndv lo the configuration. 'Hils perturbation Includes a change in depth order The 
new uonli^unnlun replaces theold If It has a larger response. This step is iterated until no 
imprcnviiK'iil is made tor a fixed number or iterations (currently 20). 

4 Summary 

A avxivm wns presented that allowed detailed, efficient estimation of human pose from 
real-world in, Mfi ea. The Improvement were due to the synergy of three novel components- 
0) u modof or gihw object occlusion that also allowed the representation and comparigon 
of puriinl (lower ditwmmma!) solutions, (ii) a novel part detector thai, by making use of 
high-bird shape knowledge earlier in the detection process, in able to account for texture 
and llu-ruby had body parrs more easily and accurately (iii) a search scheme that takes 
advanuw or Ihcsc properties, firstly by identifying body parts and then by iteratively 
combiiuftH and predicting larger conAguratlons, without making assumptions about self 
occlusion, to achieve efficient pose estimation. 
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